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Abstract 


This thesis describes a lexical study of phoneme collocational constraints using a 
metric motivated by information theory. Phonologists have long been describing the 
permissible combination of phonemes in the form of phonotactic rules. They have 
shown that these rules often can be expressed in terms of phoneme equivalence classes. 
Thus, for example, the homorganic rule for American English states the a syllable-final 
nasal-stop pair must agree on their place of articulation. Over the past decade, there 
have also been many lexical studies examining the constraining power of phoneme 
equivalence classes, demonstrating their utility for lexical access. While there are 
good reasons to express these constraints using classes well motivated by theory, the 
phoneme space clearly can be partitioned in many other ways. It is conceivable that, 
by allowing phonemes to form various sets of equivalence classes and quantifying the 
constraining power for each set, we may discover phoneme classes that will provide 
the strongest constraints for lexical access. Our line of investigation is inspired by 
recent work on word equivalence classes by Jelinek and word collocational constraints 
by Church. 

Specifically, we investigated phoneme collocational constraints using a normalized 
measure of mutual information. A pair-wise, hierarchical clustering technique is used 
to combine phonemes into classes using this metric. The result of this clustering 
procedure can be displayed as a dendrogram, from which an arbitrary number of 
equivalence classes can be selected. 

We have conducted a number of experiments investigating the collocation con- 
straints of phoneme pairs and triplets. We found that in many cases phonemes are 
organized into classes that share certain phonological features. In fact, phonemes 
that have similar acoustic properties often exhibit similar collocational constraints. 
We also compared the constraining power of our phoneme classes with those chosen 
with a phonological criterion, and found ours to be more than competitive. Based 
on our results, we conclude that our information theoretic metric is particularly well 
suited to a description of lexical constraining power. We discuss the implications of 
the results to automatic speech recognition. 
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Chapter 1 


Introduction 


This thesis explores phoneme collocational constraints, their use in discovering phono- 
logical equivalence classes, and their application for lexical access. Specifically, we 
employ an information theoretic metric over the collocational constraints to form a 
phoneme hierarchy. In this chapter we review how previous studies have investigated 
phonemic constraints. We will then discuss both the empirical and experimental bases 


which motivate our study. 


1.1 Units and their Organization 


Spoken language is composed of units of varying scales. Phonemes constitute the 
finite set of contrastive sounds in a language. The syllable, though controversial, 
seems involved in the mental representation of language. Morphemes are the smallest 
unit of meaning. Words, phrases, sentences, and discourse convey more detailed 
expressions. Units at a particular scale are combined to form the units of larger 
scales. 

These units cannot be combined haphazardly. Language provides constraints on 
how the units can be assembled to form valid structures. The constraints can be 
applied to improve the performance of machine-based speech recognition by reducing 
the difficulty of the task. To do so it is important to understand how the constraints 
can be represented and how much constraining power they offer. 

Constraints at different levels have been studied with varying degrees of thorough- 
ness. Word level constraints, which govern ordering of words into larger units, have 
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been particularly well-explored. Syntactic constraints have been codified in the form 
of grammars. Semantic constraints, for example restrictions between verbs and their 
objects, are beginning to be understood. Constraining power can be estimated by 
established metrics like perplexity. 

Our understanding of constraints at other levels is comparatively primitive and, 
to a large extent, anecdotal. In particular, the organization of phonemes into words 
is still poorly understood although its importance is unquestioned. How should these 
constraints be expressed? Can we quantify the applicability of a particular constraint? 
Is there a measure of correctness or excellence we can apply to evaluate discordant 


constraints? 


1.2 Linguistic Description of Phonemic Constraints 


Linguists have developed phonotactic rules to describe phoneme collocational con- 
straints, but they usually are achieved through enumeration and introspection. These 
rules are based primarily on phoneme environments, but factors from other levels can 
be incorporated into the constraints as well. For example, the homorganic nasal- 
stop rule states that a nasal followed by a stop within a syllable must be one of the 


following pairs: 


/mp/  /nt/ — /gk/ 

/mb/  /nd/ = /ng/ 
Thus we have words like /lemp/ and /lend/ but not */leanp/. This restriction is 
not enforced across syllable boundaries, as in /mklud/. Knowledge of this constraint 
can be used to aid lexical access in a speech recognition system by helping to anchor 
word or syllable boundaries at non-homorganic pairs. 

While these rules can be specified by enumeration of allowable sequences, they can 
be more compactly described using phonological properties, often properties suggested 
by linguists for other purposes. In the example above, we can say that the nasal and 
stop must have the same place of articulation. The fact that these constraints can be 
described in terms of properties suggests that phonemes may be organized into more 


than a flat structure. 
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One possible structuring mechanism is based on distinctive features [1]. These are 
a set of phonetic properties which describe a sound’s manner of production and place 
of articulation. Sagey [2] proposes a structure in which the features are embedded 
in a hierarchy universal across languages. Because features can be used to partition 
the phoneme space into equivalence classes, it is implied that phonemes too can be 
arranged into a universal hierarchy. 

Stevens [3] discusses how one might use distinctive features for lexical access. He 
argues that the acoustic correlates of distinctive features are more robust than those 
of a particular allophone. Features provide a compact manner of representing allo- 
phonic variation and can specify both abrupt and gradual articulator movements. 
Furthermore, the redundancy inherent in features suggests an underspecified repre- 
sentation at the lexical level. Regardless of the mechanical benefits afforded by a 
feature-based lexical access strategy, we should ask how much lexical constraining 


power they provide. 


1.3 Previous Computational Studies 


Because spoken language has always had to contend with communication through 
a noisy channel, we expect that it has evolved to include mechanisms to enhance 
robustness. It is these mechanisms we wish to discover and exploit. 

Researchers have tried to quantify the constraining power of phonotactics. In 
doing so, they hope to find a set of classes which avoids the need to make fine phonetic 
distinction yet captures much of the constraining power inherent in the lexicon. Such 
broad classes are presumably more robust and easier to detect than the phonemes 
which they comprise. 

Shipman and Zue [4] performed studies showing that even a broadly characterized 
phoneme string provides substantial constraints for lexical access of isolated words. 
In some cases, the constraint is sufficient to identify the word without finer analysis. 
They categorized each phoneme into one of six manner classes and used the classes 
to map a lexicon into cohorts containing words with the same broad class patterns. 
To measure the efficiency of the broad classes they computed various statistics on the 
cohorts’ sizes. Huttenlocher [5] refined the study by incorporating a better metric and 
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exploring the effects of lexical stress. He also showed how acoustic detectors could 
be constructed for the broad classes, an idea more fully developed by Fissore, et al. 
[6], as part of a complete speech recognition system. However none of these studies 
explored the effects of varying the broad classes and there was no experimental basis 
for the particular classes they chose. 

The lexical metrics used in these studies are questioned by Carter [7]. He ar- 
gues that a good lexical access metric should use a logarithmic scale, as it better 
characterizes the amount of additional work needed to identify a word. Accordingly 
he suggests using word entropy over a lexicon’s broad-class cohorts to measure the 
constraining power provided. 

Vernooij, et al. [8] further refine this work by noting that after broad classification 
and lexical access we still need to perform finer phonetic classification to identify a 
word in a cohort. Again we would like to make as broad a categorization as possible to 
avoid classification errors. They use a constrained clustering technique to determine 
the best intermediate classes between their five broad classes and the phonemes. 
Their metric is based on the number of words uniquely identified by a proposed set 


of classes. 


1.4 Collocational Constraints in Language 


Previous computational studies relied on the introspection of researchers for the broad 
classification used. Clearly, there are many ways phonemes can be partitioned into 
equivalence classes. We would like to determine if some reasonable set of broad 
classes, reasonable in terms of existing linguistic theories and what might be detected 
acoustically, can be derived in a data-driven manner. If so, we will have a powerful 
confirmation that our intuition is right. If not, we can at least use the results to gauge 
the relative constraining power of some other broad class set. 

Work by Shannon [9] has shown that there are strong constraints on letter se- 
quences which can be applied to efficiently encoding texts. He demonstrated this by 
applying an information-theoretic metric to letter strings. As the length of the string 
increases, the uncertainty of the following letter decreases. Unfortunately, longer 
letter sequences also capture more idiosyncrasies of the text studied and so are less 
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applicable to other texts. 

Studies by Jelinek [10] have shown that an information theoretic metric can be 
used to determine word categories from corpora. By combining words which occur 
in similar contexts, classes embodying both syntactic and semantic information are 
formed. These language structures are captured automatically without resorting to 
the incorporation of syntax or semantics in the metric. Church, et al., [11] perform 
similar computations but relax the word ordering requirement. This compensates for 
noise in the form of inserted words. 

Our goal is to apply similar techniques to pronunciations in order to find phoneme 
classes. In doing so, we will provide a more complete analysis of phoneme classes than 
is present in other studies. We are encouraged by the success of word-level studies 
as they capture linguistic information without explicitly requiring it. We expect to 


utilize the analogous structures at the phoneme level. 


1.5 Thesis Overview 


In this thesis, we will demonstrate a technique for capturing phoneme collocational 
constraints by classifying phonemes into a hierarchy. The approach we take is data- 
driven and relies on large lexicons to represent the language. By using an information- 
theoretic metric we will provide results which are meaningful and match the lexical 
access task’s complexity. We evaluate these classes against classes suggested by other 
studies using measures motivated by the lexical access task. We propose a new 
evaluation metric which may be better suited to recognition systems, particularly 
continuous speech systems, than previous measures. 

The remainder of this thesis is as follows: In chapter 2, we outline the issues 
important to our study and give our philosophy for addressing them. Chapter 3 
shows how we automatically derive phoneme equivalence classes using large lexica, 
and provides comparison using historical measures. Finally, in chapter 4 we discuss 


possible extensions of this work. 
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Chapter 2 


Approach 


In this chapter we will outline our philosophy and methods of examining phoneme 
collocational constraints. We begin by looking at the general requirements and how we 
can avoid the weaknesses in previous work. Next, we provide the information theoretic 
measures used by this study and explain the issues involved in using them on phoneme 
strings. We then discuss the clustering technique and lexicon preparation. Finally, we 


give an example of how we can use collocational constraints to form phoneme classes. 


2.1 General Considerations 


We believe that a major flaw with previous phoneme and lexical studies were the 
preconceived notions as to how phonemes should be classified. There are proposed 
linguistic and acoustic classification schemes, but they may not entirely address the 


needs of lexical access specifically. 


Vernooij, et al., demonstrated using a self-organizing technique to develop a hier- 
archy spanning from broad-classes to phonemes that is oriented to speech recognition 
systems. However, they did not carry the experiments to their logical conclusion. 
Their technique can be used to organize all phonemes into a single hierarchy, a hi- 
erarchy free of the somewhat arbitrary broad classes. Then the entire hierarchy, 
including a new set of broad classes, will be “optimized” for lexical access. We would 
expect a system using such a hierarchy to deliver better lexical access performance 
than one in which part of the hierarchy is selected according to other criteria. 
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2.1.1 Use a Minimum of Preconceptions 


For lexical experiments we must choose a particular representation for pronunciations; 
in particular we must specify a phoneme inventory. A good metric should treat the 
resulting data symbolically and make no further assumptions about what sounds the 
symbols represent or how to organize them. Although we know linguistic units larger 
than the phonemes play a role in phonotactic constraints, we do not fully understand 
this mechanism. For both of these reasons, we place no additional constraints on 
the organization of phonemes. We make no claim as to the acoustic similarity of the 
phonemes in a cluster, nor do we require the cluster to conform to an external set of 


linguistic criteria. 


2.1.2 Maximize Data Utilization 


Phonotactics provide powerful constraints on sound patterns. For example, we know 
that there are severe restrictions on the permissible consonant clusters in English. 
A good metric should be able to exploit these constraints to avoid relying on other 
sources. A metric based on word comparisons does not adequately do this. 

Consider two words, one of the form CVCC and the other CCVC where C rep- 
resents a consonant and V represents a vowel. A metric which uses strict string 
comparison compares the n‘* characters of the two strings. In this example we com- 
pare vowels against consonants while a better method might compare the clusters 
and vowels separately. To remedy this we could try performing an alignment of sorts 
between the pronunciations. It is not clear how to carry out such an alignment. 
Furthermore, the alignment process would inherently impart a bias to the results. 

There is an additional problem with this measure: it partitions the lexicon based 
on the number of phonemes in each pronunciation. Thus it only compares a word 
against words with a like number of phonemes. This can result in sparse data problems 
and may make us miss important comparisons. It also implies that our lexical access 
strategy needs to compare only words of equal pronunciation length. This is only 
true if our recognition system always hypothesizes the correct number of phonemes. 

Instead we would prefer a metric which compares a word against all others. A 
metric based on phoneme collocational data can do this because it represents all 
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pronunciations in an abstract form. By using such a metric we in turn hope to 


develop a phoneme hierarchy which reflects the phoneme collocational constraints. 


2.1.3 Apply Information-Theoretic Measures 


Other studies have demonstrated the power of applying information measures in other 
language related domains. Information measures have been shown capable of captur- 
ing word classes by combining relevant contexts. By using them properly, we should 
be able to extract phoneme classes by exploiting phonotactic constraints. We can do 
so without explicitly having to specify the nature of these constraints. 

Another benefit of information measures is that they are based on a solid mathe- 
matical foundation. Moreover, the value produced by an information measure usually 
is easy to interpret. 

We can use mutual information to specify a metric which does not partition a 
lexicon. To do so we will compare phonemes’ contexts rather than rely on word level 
comparisons. Thus the measure can employ notions of direction and locality which 


are pertinent to speech recognition. 


2.1.4 Relevance to Lexical Access 


While we are interested in the linguistic interpretation of our results, we are more 
directed by the needs of lexical access for continuous speech recognition. While we 
do not want to tie ourselves to any particular recognizer or recognition model, we 
do keep in mind the general kinds of search and discrimination which any recognizer 
must make. 

To make practical use of broad phoneme classes within a recognizer, the classes 
must be acoustically detectable and should be robust. Presumably this means that 
the classes consist of phonemes which are acoustically similar and the robustness 
stems from avoiding making fine distinctions between them. 

This study makes no use of acoustic data in forming classes. Doing so allows us to 
determine if the collocational constraints inherently embody acoustic similarities. It 
also avoids issues of spectral representation and comparison vital to acoustic distance 


measures. 
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2.2 Metric Overview 


We based the measures used in our studies on information theory. [Information theory 
is concerned primarily with the probabilistic analysis of communications systems and 
so seems well-suited to speech. Unlike typical information theory problems, we are 
not concerned with robust transmission. Instead, we are interested in the ability to 


predict a phoneme based on its context. 


2.2.1 Metric Development 


We based our metric for forming phoneme classes on average mutual information [12], 


defined as: 
Bieta P(t;y) 
TUX Vi 2 2, P(t4)lo8s Pa) Ply) 
It is the expected value of the mutual information, the amount of information about 
the event x provided by the event y. Note that the measure is reversible, making it 
equally correct to say that this measures the amount of information about y provided 
by z. An alternative description of mutual information says that it compares the prob- 
ability that x and y co-occur, P(x,y), against the probability of them co-occurring 
were they statistically independent, P(x) P(y). 
When P(z,y) = 0, we have an apparent problem, as the logarithm of 0 is unde- 
fined. We note that lim x log x = 0, and so substitute 0 for the computation in these 
instances. 


When we use the same set for both arguments of the average mutual information, 


the formula reduces to: 


I(X;X) = H(X) = — 55 P(z) log, P(z). 
x 

This is known as the entropy. It is the expected value of the self-information of the 
event x, — log, P(x). The self-information can be interpreted as the number of bits 
of information needed to specify the event x. Thus the entropy is the mean amount 
of information needed to specify events in X. 

We apply these measures by letting X and Y represent a set of phonemes. In 
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practice, the probabilities are estimated by the formula: 
~ ll 
P(2) = ix 
where |z| denotes the number of occurrences of x within some set X and |X| denotes 
the size of the set. 
To use mutual information to capture collocational constraints we choose X and 
Y to represent sequentially related phonemes. We use subscripts to indicate relative 
positions. For example, we denote the average mutual information between adjacent 


phonemes using the equation: 


_ Pee ty) 
Ag AG) = a2 P(2;, 441) log, PEP ela) 
By generalizing this we can define a mutual information measure between any se- 
quence of phonemes and another. 
Finally we want to map the phonemes into broad classes. We use ®(x) to represent 
the class into which phoneme z is mapped. As an example, we can measure the 


average mutual information between a phoneme and the phoneme class following it 


by: 
I(X;; ®(Xj41)) = 2 2 P(2;, ®(2441)) log, Bee 


Our measures presume there is at least one phoneme position being mapped into 
classes and it is from that phoneme’s position that we measure relative distances. We 
may map additional phoneme positions as well. 

There are many ways we can describe the context of a phoneme, each with a com- 
panion information measure. The simplest case, no context, corresponds to phoneme 
entropy. Next simplest is to use a single phoneme or broad class to the left or right. 
We can use a sequence of 2 (or more) phonemes or use some combination of left 
and right contexts. Finally, we can relax the ordering constraints and consider co- 


occurrence of phonemes within a window. 


2.2.2 Normalization 


The value of a mutual information measure is determined by the a priori probabilities 
of the events. We need to normalize the results in order to make them more meaningful 
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across different data sets. To do so we compute the percent information extracted by 
a phoneme class mapping, for example: 


I(X4; O(Xia1)) 


PIE= 
T( Xe XG) 


This is the ratio of the information measured with the mapping to the information 
measured without it. At best, this measure will be 100% , since the mapping can 
only reduce the information available. 

This PIE is different from the one used in Carter [7]. Both can be thought of as 
a two-step process. First we compute an information measure before and after some 
distortion. Then we compute the ratio of the two results. Carter’s PIE is based on 


word entropy while all of our measures are based on phoneme comparisons. 


2.3. Pronunciation Constructs 


We base our studies on the pronunciation of words. We will next justify the lexico- 


graphic representation used. 


2.3.1 Validity of Using Phonemes 


We feel that phonemes are a reasonable unit to analyze as their psycholinguistic basis 
has been demonstrated. We choose phonemes over allophones because the phonemic 
inventory is better agreed upon. 

Using phonemes implies that lexical pronunciations are sequences of segments. It 
is often difficult to find robust segment boundaries in an acoustic signal. Avoiding this 
dependence would require our using some more controversial lexical representation. 
We recognize this limitation, but also note that this may be surmountable for a 
particular recognition system by using data which captures segmentation difficulties. 
With it we can construct a lexicon which provides alternate word pronunciations, 
each with a varying number of phonemes. By comparing the analysis of this lexicon 
and the original, we can understand the effects of segmentation accuracy on lexical 


access. 
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2.3.2 Syllables are Too Controversial 


Syllables have been shown to provide strong constraints on the organization of sounds 
into words [13]. For example, they constrain the basic CV structures allowed. Syllabic 
structure also has strong influences on acoustic realization, especially with regard to 
prosody and reduced syllables. 

In order to make explicit use of syllable structure we need to mark that structure. 
How should we do this? The simplest approach would be to mark syllable boundaries, 
perhaps including lexical stress. A better method might parse syllables into their 
constituent structure. 

There are two problems with these explicit markers. First, the correct syllable 
structure, if there is one, is not always clear. This problem manifests itself as ambi- 
syllabic phonemes, variations in stress patterns, or even the validity of a more complex 
structure. Second, were we to settle on a particular syllable structure we must also 
choose the means of representing it. A logical approach would be to expand the 
phoneme symbol set to represent each phoneme and its syllabic position. This could 
result in a very large number of complex symbols, thus making the results difficult to 
interpret. 

We chose to circumvent these difficulties by avoiding explicit syllable markers and 
instead utilizing the structure implicitly. By using the mutual information measure 
on a suitably small phoneme sequence, we should be able to capture many constraints 
imposed by syllable constituents. Longer sequence could conceivably cover long dis- 
tance intrasyllable constraints such as those between the onset and coda. 

As an added benefit, by not breaking the word into syllables we will be able to see 
cross-syllable and cross-morpheme effects. We know that many phoneme collocational 
constraints are not enforced at these boundaries, but this can provide much constraint 


by fixing the location of the boundary. 


2.3.3. Word Boundary Independence 


The logical unit for our studies is the word, as it is also the basis for our lexicons. 
Because we are interested in continuous speech recognition, we should consider more 
than intraword constraints; we should also examine phoneme constraints across word 
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boundaries. Like at syllable and morpheme boundaries, phoneme collocational con- 
straints at word boundaries can be a powerful aid to speech recognition [14]. 

To include cross-word constraints we need to concatenate the phoneme sequence at 
the end of one word with the phoneme sequence at the start of another. An elementary 
approach is to do this for all words. This corresponds to using a (word)* grammar, a 
grammar in which a word can be followed by any other word with equal probability. 
While it is possible to construct a sentence containing an arbitrary sequence of words, 
syntax and semantics greatly constrain the set of reasonable sentences. If we wish to 
consider phoneme sequences across word boundaries, we should account for these word 
sequence constraints. However, doing so forces us to assume a particular language 
model. 

We do not wish to rely on a language model as it would tie us to a particular 
task or recognition system. Accordingly, we have chosen not to examine inter-word 
phoneme sequences, realizing that this gives us only a partial view of the collocation 
constraints. 

Words are clearly a unit of language, but the boundaries between words are dif- 
ficult to determine in continuous speech. Therefore we should try to avoid a metric 
which relies on word boundaries. 

We could include a word boundary marker, /#/, in our pronunciations to ex- 
plicitly capture these constraints. We think doing so is somewhat arbitrary and it 
ignores other word structures. Additionally, using word boundary markers may ad- 
versely affect experiments. Some lexicons are constructed to contain many regular 
forms of a word. This produces a preponderance of sequences like /d#/, /z#/, and 
/n#/. For both of these reasons we have decided not to represent word boundaries 


as a pronunciation symbol. 


2.4 Clustering Technique Overview 


We next describe how to combine phonemes into classes. Our basic approach is to 
cluster phoneme symbols based on a lexical information metric. Because we do not 
know the right number of classes for lexical access, we should incorporate some means 
of creating classes using various degrees of specificity. This suggests using some type 
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of hierarchical structure. 


2.4.1 Many Possible Classes 


There are two fundamental ways of producing a phoneme hierarchy. The first, the 
agglomerative approach, iteratively combines classes until a single class is formed 
containing all of the phonemes. The second, the divisive approach, iteratively splits 
classes until all classes contain a unique phoneme. 

We can show that there are 2"-! — 1 ways to divide n phonemes into 2 classes. 
If we have on the order of 40 phonemes, the first split in forming a binary class tree 
would require evaluating roughly 2°° —1 ~ 5 x 101! possibilities. This is far too many 
for practical purposes. Instead we chose to use a pair-wise agglomerative approach 


to form the hierarchy. 


2.4.2 Algorithm 


The algorithm for forming the hierarchy is simple. Initially, consider a set of n 
phoneme symbols. We try each of the (3) possible symbol pairs for clustering. For 
each pair, we temporarily map the lexicon to replace occurrences of the second symbol 
in the pair with the first. Then we compute the PIE resulting from this map. We 
cluster the symbol pair which maximizes the PIE and permanently map the lexicon 
using it. This reduces the symbol set size to n — 1. We iterate the procedure until 
only a single symbol is left. 

When there is a single symbol representing all the phonemes there is no infor- 
mation present to predict. Thus a necessary terminal condition of the clustering is 
that the PIE is zero. This contrasts with Carter, where a single-symbol mapping still 


retains the information present in the number of phonemes per word. 


2.4.3 Reducing Computational Complexity 


At first this algorithm seems computationally intensive because we must map the 
lexicon many times. Worse, the time needed to do so grows with the size of the 
lexicon. This presents a computational problem, since we need to use large lexicons 


so as to best approximate the actual usage. 
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Note that the reason we do the mapping is to change the parameters of the 
mutual information measure. These arguments are restricted to only a portion of 
each pronunciation. Thus we can collapse the lexicon to an m-dimensional table 
counting occurrences of m-long phoneme sequences. This table supplies values for 
estimating P(z,y). Similarly we can construct tables for the marginals P(x) and 
Pty). 

These tables alone are sufficient for calculating the mutual information measure. 
They also eliminate the costly lexicon mapping. We calculate the effect of merging two 
classes by summing the entries corresponding to those classes across other dimensions 
in the table. This procedure makes the algorithm independent of lexicon size except 
for the initial table generation. It depends only on the number of phonemes and the 


length of the sequences. 


2.5 Lexicon Preparation 


We base most of our studies on a modified version of the Merriam Webster Pocket 
Dictionary, which we will refer to as “MPD.” The lexicon contains roughly the 20,000 
most common words in American English. We chose this lexicon because many 
researchers have refined and checked its pronunciations. In addition, using it will 


allow us to compare our results to some earlier work. 


2.5.1 Modifying Pronunciations 


We want to be able to compare our results to those produced by distinctive features 
because their constraints for lexical access are poorly understood. Features, in the 
form of feature vectors, have difficulty representing some of the symbols used in MPD, 
notably the diphthongs and syllabic consonants. In addition, lexical stress in MPD 
is indicated both through the use of stress markers and by schwas in reduced stress 


syllables. To eliminate these problems we apply the following rewrite rules: 


fe/s/af fifofxf [e//s/ 
[W/a/alf /m/s/am/ /n/-/an/ — /n/>/an/ 
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/a¥//aif  /®/-/aif  /a"/—/av/ 


No change is made to the diphthongs /i”/, /e’/, or /o”/ since their monophthong 
counterparts, /i/, /e/, and /o/, can be represented by feature vectors. In addition, 


all syllable and stress markers are removed from the pronunciations. 


2.5.2 Word Frequency Weighting 


Previous studies have shown that weighting the lexicon using frequency of occurrence 
in the Brown Corpus [15] can have dramatic effects due to the overwhelming influence 
of common words. We think of such weighting as a zeroth-order language model. 
Using any language model opens a host of issues which we could not adequately 


address in this thesis. Accordingly we chose to ignore weighting effects for clustering. 


2.6 An Example 


We will next present an example of our clustering technique so that the reader can 


better understand it. For this example we will use a subset of MPD consisting of 33 


words: 


Spelling 
ail 


peel 


spalling 
lie lai 
lisle 


lail 


loll lal 

lollypop | lalipap 
lap 

lye lai 


pi 
pa 
pa 
pi 
pil 


We selected these words because they are the largest. number of words which can be 
formed using only four different phonemes. 

Let’s find the phoneme clusters created using the mutual information between 
adjacent clusters as the metric. There are 65 diphones in this sub-lexicon of which 
only 10 are unique. We estimate the probability of a diphone occurring from its 
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frequency in the data and summarize the results in a table: 


P(2z, Ti41) 


65 5 | 65 
ene, eee” 


Using these data we can compute the desired mutual information and PIE: 


Pea) 

1(Xi3Xe1) = (2;, 2:41) logy 

» De * Pa) Pad) 
= 0 2 log, a aig + & los Hig + & low wg 2S a z+ 

8 o 
sltnir+ O + 0 + g logy wir + 
& log, ws +# 2 logy a. as + 0 + a log, row + 
0 + & log, gr x + § log, owe + 0 
x 0.719 bits 
PIE = 100% 
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Using the 4 phonemes, there are (*) = 6 ways we can form 3 classes. We try each 
g : y 


of these possibilities and note which one has the greatest PIE: 


1 (® (X;);® (Xig1)) © 0.272 bits 1 (® (X;); © (Xig1)) & 0.243 bits 
PIE x 38% PIE = 34% 


1(®(X,);® (X41) © 0.610 bits 1 (® (X;); © (Xi41)) & 0.283 bits 
PIE = 85% PIE = 39% 


I(®(X;);® (Xj41)) © 0.297 bits 1(®(X,); © (Xi41)) ¥ 0.363 bits 
PIE © 41% PIE © 51% 
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From this we determine it is best to merge /1/ and /p/ into a single class. We 


keep this merger and consider the 3 possible sets of 2 classes: 


L(® (X;) ;® (Xi41)) 0.178 bits 


PIE & 25% 


PIE & 22% 


1; P; 
liga 3 
Pi41 65 
Gis1 | 29 
lig1 | 65 


1 (® (X,); © (Xj41)) © 0.203 bits 
PIE & 28% 


Of these possibilities, merging /a/ and /i/ produces the best results. Finally, we 


are left with only one merger possible: 


I(® (X;) } ® (Xj41)) = 0 bits 
PIE = 0% 


28 


f=) 


ai 


nN 
an 
a a Ca ee ee ee ee ee ee 


a 
(=) 
oO 

L 


% Information Extracted 
“I oO 
oO oO 


Figure 2.1: Dendrogram corresponding to the example phoneme clustering. 


We can best understand the resulting hierarchy by displaying it in the form of a 
dendrogram {16], shown in Figure 2.1. The dendrogram’s abscissa lists the phonemes 
in our hierarchy while the ordinate shows PIE. We denote two classes joining by 
drawing a horizontal line across them at the PIE level where they merged. This 


display allows us to see both the phoneme clusters and their relative robustness. 


2.7 Summary 


We have presented a technique for forming phoneme clusters using a minimum of pre- 
sumed knowledge. In addition, we have chosen to use an information-theoretic metric 
because its utility in capturing collocational constraints has been demonstrated. We 
have shown how a metric can be defined over a phoneme sequence and have justified 


the pronunciations we use. 
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Chapter 3 


Experiments 


In this chapter we present the results of our phoneme clustering experiments. We 
evaluate the performance of our clusters against those suggested by other studies 
using both historic and new lexical metrics. Finally, we examine some of the questions 
related to our techniques. 

Our experiments considered many variations on the basic theme. For clarity, we 
will present only some representative results here. Some of our additional phoneme 
clustering experiments are described in appendix A. 

In our work we have striven to avoid the preconceived notions of how to structure 
phonemes; yet, we cannot conduct nor discuss our study in a vacuum. We will always 
need to compare our results to alternate structures, and the best frame of reference 
available consists of the phoneme groupings proposed by linguists, e.g., distinctive 
features. Accordingly, we will note similarities and differences between these two 


approaches whenever appropriate. 


3.1 Diphones 


We begin by considering the use of the minimum contextual information. In our 
first study, we cluster phonemes based on the average mutual information between 
a phoneme’s class and the following phoneme’s class. Because the average mutual 
information is a reversible measure, it does not matter whether we consider a phoneme 
and its successor or its predecessor. We capture constraints in both directions at once. 
We display the results of our clustering as a dendrogram in Figure 3.1. 
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Figure 3.1: Dendrogram produced by clustering based on diphones. 


Perhaps the most striking aspect of this hierarchy is that it completely segregates 
the vowels and consonants from one another. This single distinction provides a large 
amount of information, roughly 30% of what is possible. 

Vowels are organized into subclasses, and some of the divisions are suggestive of 
dimensions used to describe vowel color. For example, /i/ and /u/ are high and 
tense while /u/ and /9/ are back and rounded. 

The consonants can be viewed as being divided into four roughly-separated classes: 
semivowels, fricatives, stops, and nasals. The stops are divided based on their place of 
articulation. There may be an affinity between the coronals, demonstrated by clusters 
containing /1/ and /r/; /s/ and /n/; and /8/,/z/, /8/, [i/, [2], [€/s [8/, /4/; 
and /t/. It is interesting to note that /h/, often an oddity for other classification 
schemes, is placed among the semivowels. 

Many of the classes formed contain phonemes which are similar acoustically. This 
is fascinating, as no part of our clustering procedure requires this to be so. We 
suggest that this may be viewed as evidence that language’s acoustic and contextual 
constraints may be evolving simultaneously. 

The dendrogram shows that the robustness of the phoneme classes varies through- 
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out the hierarchy. In particular, some of the initial clusters formed are not very robust 
and soon merge with other clusters. Yet these clusters are crucial in determining the 
form of the remaining tree. Perhaps some of the phonemes which are grouped against 
our linguistic intuition do so because of these unstable initial steps. 

We can examine these critical times in the clustering process to search for slightly 
lower scoring classes which give better linguistic justification. A class which is suffi- 
ciently close in score to the best might have won under different conditions, perhaps 
a different lexicon or a divisive search. Examination of our dendrogram confirms 
this suspicion. For example, even though the /fbp/ cluster is satisfactory, linguists 
might prefer the stops to have merged first. When we examine the clustering scores, 
placing /f/ with /b/ yields 97.52 PIE while /f/ with /p/ yields 97.31 PIE and /p/ 
with /b/ yields 97.29 PIE. These are relatively close. Thus we can attribute at least 
some of the “improper” clustering to competing classes which nearly yielded the most 
information extracted but failed to do so. 

In terms of linguistic description, perhaps the worst cluster in the hierarchy is 
the fairly robust one combining /yn/ with /s/. Linguists might prefer clustering the 
nasals and placing the /s/ near /8/ or /z/. How might we explain this clustering? 
In part, it may be the aforementioned gravity of the coronals. Alternatively, it may 
be an artifact of the lexicon having many words containing /m/, /m/, or /1s/. 

The fact that such a simple process can yield a reasonable looking dendrogram is 
most encouraging. Even now we can see a fair degree of agreement between linguistic 


theory and our data-driven approach. 


3.2 Triphones 


We next add additional contextual information to see if this will result in classes better 
fitting linguistic descriptions. We do so by considering sequences of three phonemes 
rather than two. Note that these metrics incorporate directionality, unlike the metric 
used in the previous experiment. It arises because we treat the phonemes unequally 
by pairing two as the “context” for the third. Thus, there are three possible ways to 


compute the average mutual information over a sequence of three phonemes, 212273: 


T(X1; (Xa, X3)) T (Xo; (441, X3)) T (X33 (X41, X2)) 
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Figure 3.2: Dendrogram produced by clustering based on class in context. 


In this experiment, we cluster based on a phoneme’s class and the classes of the 
phonemes immediately left and right. This is motivated by the “phoneme in context” 
unit commonly in use, though the name “class in context” is more appropriate for 


our measure. The resulting dendrogram is shown in Figure 3.2. 


Clustering based on class in context results in clusters even more linguistically 
relevant than those of the diphone experiment. The general structure of the vowels 
is essentially the same. One improvement is that /#/ and /e/ now cluster together 


before /o/ is added. 


Among the consonants, we find /f/ clusters with fricatives rather than labial stops 
and affricates form a robust cluster. Many low-level clusters are formed by joining a 
voiced phoneme with its unvoiced counterpart. Rather than standing alone, /s/ now 


clusters with /z/ and /m/ joins them along with /n/ and /y/. 


Some changes are not for the better. We find /s/ is now grouped with the 
semivowels while /1/ is not. Nonetheless, these results are better than those of the 


diphone experiment. 
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3.3 Cluster Evaluation 


We can use the hierarchies we created to partition phonemes into classes suitable 
for lexical access. We can use any set of classes from the hierarchy provided they 
encompass all of the phonemes and are mutually exclusive. Having done this we 
should ask how effective these classes are from a lexical access standpoint and compare 


them to classes suggested by other approaches. 


3.3.1 Lexical Experiments 


One way to evaluate our results is the type of procedure first used by Shipman and 
Zue [4]. They mapped a lexicon’s pronunciations in accordance with six manner 
classes and gathered the words into cohorts. They determined the efficacy of the 
classes by computing a statistic over the cohorts. 

Unlike previous lexical experiments which relied on a fixed number of broad 
classes, we can vary the number of classes used. There is a simple way to select 
n classes from a dendrogram for 1 <n < number of phonemes: we “slice” the den- 
drogram horizontally at the level which provides the number of classes we desire. This 
allows us to study the tradeoff between the number of classes used and the fineness 
of phonetic distinctions made. 

As we mentioned earlier, using word-level measures partitions the lexicon based on 
the number of phonemes per word. A measure which did not force this partitioning 
would be more appropriate for both isolated and continuous speech recognition. The 
information measures we used for clustering fulfill this requirement. Given a set 
of classes we can map the lexicon and compute the resulting PIE. By varying the 
underlying information measure, we can tune the lexical measure to suit a particular 
lexical access task. 

While more abstract than cohort-based lexical measures, we think this metric 
is still relevant as we measure the additional information needed to distinguish a 
phoneme from its classmates. Doing so is one way of viewing the lexical access 
problem. In addition, our measure uses a logarithmic scale, which Carter [7] suggests 
is more appropriate for measuring the work needed to complete the lexical access task. 
Like Carter’s metric, we can produce either a normalized result for easier comparison 
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on a particular task or an absolute result for comparison across tasks. 

In our evaluation we provide both cohort-based and phoneme-based measures 
of cluster performance. The first measure we use is expected cohort size as it is 
representative of the cohort measures and seems more appropriate than mean cohort 
size. The other measure we use is the average mutual information between a class and 
its neighboring classes given in PIE form. This is precisely the second measurement 
we used for forming phoneme classes. For brevity, we refer to this measure as “context 


PIE.” The values of some additional lexical measures are given in appendix B. 


3.3.2 Baseline Establishment 


Previous experiments have shown how well a set of broad classes can disambiguate a 
word in a lexicon. However, these experiments rely on the conviction of the reader 
to evaluate both the results and the metric. We would prefer an objective baseline 
against which a classification scheme can be measured. 

One possibility is to compare the broad classification to the finest classes possible, 
the phonemes. This does not work well because we know it is possible to disambiguate 
virtually all of the lexicon using phonemes. It is similarly unreasonable to use no 
classes, to use a phoneme placeholder, though this is an important limit for cohort- 
based measures. 

There is a simple way to create a baseline for broad class evaluation. We gener- 
ate a hierarchy by combining phonemes at random. A typical dendrogram created 
this way is shown in Figure 3.3. It is important to note that this dendrogram does 
not embody the class structuring typical of dendrograms created using collocational 
constraints. We create 1000 hierarchies in this manner and compute our lexical perfor- 
mance measures using their classes. We then average the results to form the baseline 


performance. 


3.3.3 Distinctive Features 


We would like to compare our results to the performance of distinctive features. We 
use a feature set, shown in Table 3.1, based on Stevens [17]. The primary change we 
have made to the feature set is to specify all features’ values for all phonemes. Where 
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Table 3.1: Feature-bundle specifications for phonemes used in our experiments. 
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Figure 3.3: A phoneme hierarchy produced by random clustering. 


Stevens left a feature unspecified, we use a “—” to indicate that the feature is not 
present. We have also replaced the features [Spread Glottis] and [Slack Vocal Cords] 
with the more common [Voiced]. 

Note that some of the features are unnecessary or redundant for our phoneme set. 
Were we to eliminate the [Nasal] feature, there would be no ambiguity amongst the 
phonemes. Other features, for example [Voiced], are critical in that their elimination 
would cause ambiguity. 

By underspecifying feature values we can form phoneme equivalence classes. In 
order to vary the scale of the equivalence classes formed, we vary the number of 
features left unspecified. We begin by ranking the features to select the best single 
feature to use. We define best as giving the greatest cluster context PIE. We keep 
the phoneme distinctions made by this feature and repeat the process to select an 
additional feature. We iterate until all features are enabled. 

This process is illustrated in Figure 3.4. The height of each bar shows the PIE for 
a particular subset of features. The numbered bottom axis measures the size of each 
feature subset. The remaining axis is divided into categories representing the features. 
If we looked from above we would see a triangular portion of the base covered. This 
is because once we enable a feature it is never disabled and we have arranged the 
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Figure 3.5: Maximum information extracted by compounded feature specification. 


features in the order of their inclusion. 

This figure permits us to see how the relative importance of a feature depends on 
those already specified. For example, by examining the row corresponding to using 
a single feature, we find that [Syllabic] is the best with [Consonantal] second best. 
The two features provide redundant information. As shown in the second row, once 
we have specified [Syllabic] the importance of [Consonantal] diminishes. Instead, the 
[Tense] feature, insignificant in the one-feature ranking, is now the best one to add. 
Although it was relatively important initially, [Consonantal] will end up providing no 
useful information at all. 

This procedure forms a ranking of the features based on the phoneme identification 
information they supply. Although the information is represented in the rear diagonal 
of the previous figure, we reproduce it in Figure 3.5 for clarity. Although we can use n 
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Figure 3.6: Dendrogram of a phoneme hierarchy produced using an acoustic similarity 
metric. 


features to specify 2” classes, the redundancies between features as well as unrealized 


feature bundles mean that in general fewer classes will be available. 


3.3.4 Acoustic Clustering 


We also compare our results to a phoneme hierarchy produced using acoustic data. For 
this we use a method developed by Glass [18]. This technique uses a spectral average 
to represent each cluster. The most similar clusters are merged in an agglomerative 
clustering procedure. 

For our experiment, we produced a hierarchy based on phonetic transcription 
segments from 1000 TIMIT database [19] utterances. Rather than using all phonetic 
transcription characters, we reduce the set to those symbols used in our other studies. 
We include the diphthongized vowels where a monophthong is not used in transcribing 
the data. 

The resulting dendrogram is shown in Figure 3.6. We can extract phoneme classes 
from this dendrogram using the same procedure as for our other dendrograms. 

Many of the classes in this hierarchy are similar to ones produced using phoneme 
collocational data even though our experiments do not use any form of acoustic 
similarity measure. A key difference between the results is in the two cluster split: 


4] 


semivowels are more like vowels than consonants in this acoustic classification, but 


they behave more like consonants linguistically. 


3.3.5 Results 


We conducted our evaluation on the MPD lexicon, prepared exactly as for our 
phoneme clustering experiments. The results are shown in Figure 3.7. 

A lower expected cohort size is interpreted as better for lexical access. The smaller 
the expected cohort size, the fewer average number of words we need to distinguish, 
and so the less we need to rely on making accurate fine phonetic distinctions. 

The performance of acoustic clusters is generally worse than that of the other 
cases, however it is somewhat better when using only a few classes. The acoustic 
class performance is punctuated by discontinuities. Each of these may correspond to 
particularly important phoneme distinctions. 

Distinctive features are narrowly the worst choice when using fewer than 10 classes. 
This is the range most useful for lexical access with broad classification. Note that 
the expected cohort size for features drops faster than for the acoustic classes. 

Our collocational constraint-based clusters perform better than either of the afore- 
mentioned schemes except for the 2- and 3-class cases and performs comparably to the 
six manner classes. The discontinuity between 8 and 9 context clusters corresponds 
to the first fully identified phoneme, /r/, splintering from its parent class. 

Unfortunately, using expected cohort size we find that our randomly formed clus- 
ters perform best, except when using fewer than 6 classes. Since we believe phoneme 
classes derived using speech knowledge should perform better than those derived ran- 
domly, we conclude that expected cohort size is a poor measure of lexical access 
difficulty. 

We expect a larger value to signify better classes when using context PIE. Larger 
values mean we have come closer to identifying phonemes in the lexicon. Presumably 
this means we are close to identifying words as well. 

We first check the performance of randomly-formed classes. These classes now 
clearly perform poorly compared to the others. Based on this alone, we have reason 


to believe this metric is superior to expected cohort size. 
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Classes based on context PIE perform best regardless of how many categories 
are used. This is not surprising since the clustering procedure finds locally optimum 
classes for this mutual information measure. Were we to expand our search to seek 
the optimum set, we could provide an upper bound on class performance using this 
metric. 

Using this measure we find features perform relatively well for less than ten classes. 
The manner classes perform on par with distinctive features. The acoustic classes 
perform notably worse over this range, but later move to match the performance of 
features. 

The two class case is particularly interesting. Feature- and context-based clusters 
yield just over 25% of the lexicon’s information while acoustically-based clusters yield 
only about 10%. The only difference in the class pairs they use is that acoustic clusters 
place the semivowels with the vowels instead of with the consonants. The difference 


in performance must stem from this. 


3.4 Discussion 


Our study raises a number of questions we shall proceed to address. We shall take a 
closer look at the phoneme collocational data and examine how some of the decisions 


we made affect our results. 


3.4.1 Capturing Collocational Constraints 


How can we be sure the dendrograms we produced look the way they do because of 
phoneme collocational constraints? For assurance we perform a similar experiment 
in which we use no contextual information. We repeat our clustering procedure using 
PIE based on phoneme class entropy. Another way of viewing this is we base our 
clusters on phoneme frequency of occurrence alone. The resulting dendrogram is 
shown in Figure 3.8. 

This dendrogram is very much unlike the dendrograms we see in our other exper- 
iments. The clusters formed do not stand out as anything linguistically relevant with 
two possible exceptions: /n/ with /t/ (which have the same place of articulation) and 
the neighborhood of /z/. Perhaps the most striking difference is that the consonants 
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Figure 3.8: Dendrogram produced by clustering based class entropy. 


and vowels are intermingled throughout this dendrogram. 
This suggests that our other experiments are exploiting phoneme collocational 


constraints and that these constraints are fairly powerful. 


3.4.2 Relationship to Pattern Recognition 


How can we gain a better understanding of the information captured by our metrics? 
An alternate view of our hierarchical clustering procedure is that it gathers phonemes 
which occur in similar contexts. In the dendrogram of Figure 3.1, phoneme classes 
are combined when the distributions of their adjacent classes are similar. Although 
we have presented the experiments from an information-theoretic standpoint, we can 
also view the work as a pattern classification problem. 

By examining these distributions we can gain a better understanding of the clus- 
tering decision process. For example, in Figures 3.9 and 3.10 we show the distribution 
of all phonemes following /b/ and /p/, and /e/ and /1/, respectively. The data is 
normalized so that the area under the curves is equal. Each figure shows two phonemes 
which are similar in terms of manner and place. 

It is important to note both the similarities and differences in these figures. Both 
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Figure 3.10: Relative frequencies for phonemes following two vowels. 
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of the stops share a profile dominated by vowels and the consonants /1/, /y/, /r/, 
and /s/. The phonemes following /e/ and /1/ show a radically different distribution 
from those following /b/ and /p/. 

Our diphone mutual information measure used the distributions for both preced- 
ing and succeeding classes. These distributions can differ greatly, as illustrated in 
Figures 3.11 and 3.12. The effects of the homorganic nasal-stop rule are clearly in- 
dicated in the first distribution. In fact, the phonemes following /n/ are dominated 
strongly by /k/ and /g/. Thus looking forward there seems to be little similarity 
between the nasals. Looking backwards provides more similar profiles, and again /1 / 
seems better constrained. 

Using such data we can see how phonemes with similar contexts cluster together. 
Note that our mutual information measure is based strictly on these distributional 


constraints, yet many of the clusters formed are also acoustically similar. 


3.4.3 Effects of Altering Pronunciations 


How did changing the pronunciations of the MPD lexicon affect our clustering? To 
answer this we repeated the diphone experiment using the unadulterated lexicon. We 
included all of the symbols, even the stress and syllable markers. The results are 
shown in Figure 3.13. 

Here we see numerous effects not present in the original experiment. The syllable 
markers, /</, /—/, and /*/, are clustered together as are the stress markers /’/ 
and /‘/. The nuclei of reduced syllables, all three varieties of schwa as well as the 
syllabic consonants, form a cluster. 

The differing symbol sets makes it difficult to compare the phoneme clusters to 
those of the altered pronunciations. We still see similarities between the two, par- 
ticularly at higher levels in the dendrogram. We also note that some of the initial 


clusters formed are less robust than for the mapped lexicon. 


3.4.4 Lexicon Idiosyncrasies 


How are our results affected by the limitations, of both size and structure, inherent in 
the lexicon? A lexicon is produced within a set of guidelines to ensure pronunciation 
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Figure 3.11: Relative frequencies for phonemes following two nasals. 
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Figure 3.12: Relative frequencies for phonemes preceding two nasals. 
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Figure 3.13: Dendrogram produced by clustering based on diphones. 


consistency. Our process cannot distinguish between these rules and true linguistic 
constraints. Also, our results may improve if we use a larger lexicon as it will provide 
more stable initial clusters. 

To understand these effects we repeated our trigram experiment on two additional 
lexicons. The first is the Shoup lexicon [20]. It uses cover symbols which represent 
a set of phonemes in an attempt to capture phonemic variability. The cover symbols 
work well when there is at most one in a word’s pronunciation. When multiple cover 
symbols are present it may be overly generous. It includes many regular forms. The 
second lexicon is the MobyPronunciator lexicon [21]. It allows multiple pronunciations 
per word but lists them explicitly. It includes only irregular word forms. The Moby 
lexicon contains foreign terms which we have removed based on the use of non-English 
characters in their orthography. 

We processed both lexicons in a manner similar to our altering of MPD. However, 
we permitted the Shoup lexicon to retain monophthongs not found in the others. 
Summary statistics for these three lexicons are shown in Table 3.2. 

The results of clustering based on these lexicons are shown in Figures 3.14 and 
3.15. The Shoup lexicon produces a dendrogram which hints at the MPD results but 
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Figure 3.14: Phoneme clustering based on context PIE derived from the Shoup lexi- 
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Figure 3.15: Phoneme clustering based on context PIE derived from the Moby lexicon. 


MPD Moby | Shoup 


Number of Pronunciations 19,837 | 161,675 | 494,569 
Maximum Pronunciation Length 17 30 24 
Mean Pronunciation Length 6.52 8.61 8.67 
Median Pronunciation Length 6 8 8 


Table 3.2: Summary statistics for three large lexicons. 


is clearly not as good. We suspect this is an artifact of the over generation present 
in the lexicon. The Moby lexicon produces a hierarchy similar to the MPD results. 
Again, the clusters higher in the dendrogram have more in common than those at 


lower levels. 


3.5 Summary 


In this chapter we have shown how we generate phoneme clusters using a mutual- 
information metric. The classes are formed of phonemes that often share linguistic 
properties. We have presented the same metric as a new way of measuring the power 
of a set of classes within a broad-classification lexical access scheme. We have shown 
how we can provide both upper and lower bounds for comparing existing classification 
schemes on this scale. Finally we have examined some of the factors which affect our 


results. 
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Chapter 4 


Conclusions 


Phoneme collocational constraints provide a fertile and largely unexplored area of 
research. This thesis cannot possibly address all aspects of this vast subject. In this 
chapter we will summarize our work. We will conclude by offering a few possible 


extensions of our work and describe how they might be accomplished. 


4.1 Summary of Results 


Previous studies have shown the utility of a set of broad phoneme classes for lexical 
access. Broad classes can significantly constrain word candidates from a lexicon. 
They can avoid fine acoustic distinctions and so may be detected more robustly. A 
speech recognition system can exploit these two features to improve performance 
while reducing computation. 

We demonstrated how phoneme collocational constraints can be applied to pro- 
duce a hierarchy of phoneme classes. We use mutual information as a metric because 
of its success in capturing word collocational constraints. The classes we construct 
are reminiscent of linguistic and acoustic classes even though we did not tap these 
knowledge sources. This can be viewed as evidence of a global constraint optimization 
affecting phonological, lexical, and acoustic domains. 

We repeated the lexical studies of previous experiments to demonstrate the con- 
straining power of our phoneme classes. Because we arrange the phonemes into a 
hierarchy, we can evaluate classes of varying coarseness. In these studies our results 
compare favorably to those of other phoneme classification strategies. 
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We also have shown that the lexical metric used in these studies ranks a baseline 
of random classes as being better than other techniques. This spurred us to apply 
an alternate lexical metric based on phoneme collocational data. This metric ranks 


random classes below others, as a good metric should. 


4.2 Suggested Extensions 


There are many ways to enhance the work we have presented. Rather than provide 


an exhaustive list, we will consider only those we believe are most important. 


4.2.1 Word Sequence Modelling 


Researchers conducted earlier lexical studies at a time when large vocabulary contin- 
uous speech recognition was impractical. Accordingly, these studies were geared for 
isolated word systems. The questions of lexical access complexity are still relevant, 
but we must shift our focus to more natural speech. 

We did not address across-word phoneme constraints because we did not wish 
to introduce a language model variable into an already complex study. Adding the 
necessary data is a relatively easy task. If the phoneme sequences we are interested in 
are sufficiently shorter than the words, a bigram language model should be adequate. 
Longer phoneme sequences would require more complex language models to ensure 
accuracy. 

The results of such studies would reflect the continuous speech lexical access task 


more accurately than does our current work. 


4.2.2 Significance Pruning of Seed Sequences 


Some of the initial clusters in our dendrograms are not robust. The initial clusters are 
important when we are using agglomerative hierarchy construction. Because these 
classes make the finest distinctions, they are formed when data are most likely to 
be sparse. We propose using a significance test to prune entries from our sequence 
frequency table before clustering as a way of improving our classes. 
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4.2.3 Alternate Clustering Techniques 


We have chosen to explore a single clustering technique using a single type of metric. 
Our choices were motivated by previous studies. We have shown how phoneme cluster 
generation can be viewed as a pattern classification problem. We can apply other 
pattern classification techniques for exploring phoneme collocational constraints to 
determine which is best. We can also compare the results of other distance metrics, 


for example Euclidean distance between phoneme frequency contours. 


4.2.4 Recognizer Tuned Clusters 


We can adapt our procedures to the performance of a particular speech recognizer. We 
use the recognizer’s lexicon and language model. We can map the lexicon using the 
recognizer’s confusion matrix and insertion/deletion statistics to simulate the input 
to the lexical access component better. This should give us a more realistic estimate 


of broad classification power. 


4.2.5 Acoustic Detectability 


Broad phoneme classes formed by our procedure are of little use to a recognizer if we 
cannot detect them reliably. We could incorporate acoustic distance measures for the 
classes into our clustering procedure, but this is not what we really seek. Instead, we 
propose building acoustic detectors for our broad classes. There are established means 
for evaluating the performance of these classifiers. Should we discover a particular 
class cannot be detected reliably, we can inhibit its formation in the classification 
tree. 

There is an alternative way we can evaluate our classes based on acoustic data. 
By summing rows and columns in a particular speech recognizer’s confusion matrix, 
we can approximate how our broad classes will affect that system’s performance. We 


can use the entropy of the matrices to compare phoneme classification schemes. 


4.2.6 Measuring Tree Stability 


We have used lexical measures to evaluate our phoneme classes objectively. We have 
not provided an evaluation of the phoneme hierarchy itself. There may be ways we can 
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compare two hierarchies, perhaps based on the two halves of a lexicon, to determine 


the stability of the classes. 


4.3 Summary 


This thesis provides an approach for evaluating phoneme collocational constraints as 
they apply to lexical access. We have shown results which demonstrate the power- 
fulness of these constraints. We also have exposed a potential problem with earlier 
lexical studies on broad phoneme classification. Finally, we offer support for linguistic 


theories of phoneme structuring. 
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Appendix A 


Related Clustering Experiments 


In this appendix we present some additional clustering experiments based on phoneme 
collocational constraints. We will briefly describe each experiment, show the resulting 


dendrogram, and discuss the outcome. 


A.1 Directional Diphone Measure 


Although the mutual information measure is reversible, we may create a directional 
measure using it by treating its arguments unequally. One way to do this for diphones 
is to use a phoneme’s class and the adjacent phoneme as arguments rather than use 
two classes. Thus we can allow a class to predict the following phoneme using the 
measure [(®(X;); X;4,). An alternative interpretation is that each phoneme predicts 
the preceding class. By using [(X;; ®(Xi41)) we reverse the direction of the measure. 


We show the results of both in Figure A.1. 


‘Stedylewhavd] emgfpbkensiue 


254 moooaywareSiuur 
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‘% Information Extracted 
‘% Information Extracted 


Figure A.1: Dendrograms produced by clustering based on a class predicting the 
following phoneme (left) and the preceding phoneme (right). 
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Figure A.2: Dendrogram produced by clustering based on a class predicted by the 
preceding two classes. 


Neither of these hierarchies is as good as the one shown in Figure 3.1, especially 
since there isn’t a clean split between the vowels and consonants. Both dendrograms 


have many clusters based on phonemes with similar manner or place. 


A.2 Forward Prediction Triphones 


Many speech recognizers use a left-to-right control strategy. A lexical access strategy 
based on a phoneme and its immediate context would need to process input one 
segment behind the acoustic decoder. We would like to determine if this delay is 
necessary or if we could use past context only without losing constraint. To better 
understand this, we used the measure I (®(X3);(®(X1), ®(X2))) to construct the 


dendrogram shown in Figure A.2. 


Again, we consider the intermingling of vowels and consonants an indication that 
using a phoneme and its neighbors provides superior performance. 
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Figure A.3: Dendrogram produced by clustering based on phoneme in context using 
two disjoint halves of the MPD lexicon. 


A.3 Phoneme Class Stability 


We desire a simple method for examining the stability of our phoneme clusters. By 
selecting words at random, we create disjoint halves of the MPD lexicon. We re- 
peat our experiments using both sections and compare the results. Dendrograms for 
phoneme in context clusters created using this technique are shown in Figure A.3. 
The hierarchies are far from identical and yet are clearly related. Both dendro- 
grams separate the vowels from the consonants. We find numerous phoneme pairs 
in both dendrograms. Furthermore, we can see the “poisoning” effect an ill-fitting 
phoneme can have on clustering. For an example of this, compare the placement of 
/f/ with the stops in both dendrograms. Notice how the cluster containing it is far 


less robust than when it is absent. 


A.4 Word Frequency Weighting 


As we discussed earlier, we view weighting by word frequency as a weak attempt at 
language modelling. Still, it is traditional to perform lexical experiments with and 
without such weighting. We give the results of a single experiment using weighting 
from the Brown corpus in Figure A.4. 

Much of the dendrogram is comparable to the unweighted version. We do see some 
changes, notably /0/ clustering with /u/. This is an unusual cluster but not entirely 
unexpected. We know that /8/ is common in function words, and these words are 
the most abundant in the Brown corpus. It is reasonable to expect these extremely 
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Figure A.4: Dendrogram produced by clustering based on phoneme in context using 
word frequency weighting. 


frequent phonemes, and phonemes they are neighbors with, to behave differently. 


A.5 Longer Range Effects 


The dendrogram produced using phoneme in context was better than the one pro- 
duced using only a single neighbor. We want to explore what will happen when we 
use an even larger context. As we expand the context to include more phonemes 
it becomes ever more important to use a language model to provide across-word se- 
quences. We do not do this. We do, however, relax the sequence constraint in hopes 
of capturing more relevant events. 

For these experiments, we consider the occurrence of a phoneme class in a window 
preceding a selected class. Thus we use a window of length n — 1 for a sequence of n 
phonemes. We show results for windows of length 2 through 5 in Figure A.5. 

It is interesting to compare dendrogram (a) in this figure to Figure A.2. These 
differ only in the enforcement of a sequence constraint. This constraint seems to 
provide much information as demonstrated by the division of vowels and consonants. 
Also, the window approach yields a dendrogram with less robust fine clusters. 
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Figure A.5: Dendrograms produced by clustering based on phoneme co-occurrence 
within windows of varying length. 
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Using a longer window produces a better dendrogram, as expected. We believe A 
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Appendix B 


Related Lexical Experiments 


In this section we give the results for four additional lexical measures of broad classi- 
fication performance. We do so because these measures are common in the literature. 
We conducted all of these experiments using the MPD lexicon. 

The first measure is the percentage of words in the lexicon uniquely specified by its 
class pattern, shown in Figure B.1. This is considered important because it represents 
words which require no further acoustic discrimination for completing lexical access. 

All of the classes perform similarly except for distinctive features, which performs 
decidedly worse. Randomly formed clusters generally perform best.. 

In our next graph, shown in Figure B.2, we compare classes using the word en- 
tropy measure advocated by Carter [7]. Again we see a poor separation between 
classification schemes, notably random classes. 

We show the maximum cohort size in Figure B.3. This provides a measure of the 
most difficult word identification problem remaining after broad classification. Again 
we see the classes performing similarly except for the worse performance of features. 
Here we also see many discontinuities. These arise mainly from individual phonemes 
being identified. 

Finally, we examine mean cohort size. It is a cousin of expected cohort size, but 
perhaps gives a less accurate picture of lexical access difficulty. The results for the 
measure are shown in Figure B.4. 

The highly skewed nature of the data means it is difficult to make detailed com- 


parisons. The results are similar in nature to the previous three. 
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Figure B.1: Graph of phoneme class performance as measured by percentage of words 
uniquely specified by their class pattern. 


64 


Random 
Clustering 


Manner 
Classes 


Acoustic 
Clusters 


Distinctive 
Features 


Context 
Clusters 


Word Entropy Information Extracted 


0 10 20 30 40 
Number of Classes 


Figure B.2: Graph of phoneme class performance as measured by word entropy. 
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Figure B.3: Graph of phoneme class performance as measured by maximum cohort 
size. 
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Figure B.4: Graph of phoneme class performance as measured by mean cohort size. 
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